Assignment 2/ Group 5¶

Section 1: DATA DESCRIPTION¶

1.1 Dataset: 'price-demand'¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns

plt.style.use('ggplot')
In [2]:
pd.set_option('display.max_columns', 40)
In [3]:
# load the two csv files as dataframe
price_demand = pd.read_csv('price_demand_data.csv')
weather = pd.read_csv('weather_data.csv')
In [4]:
price_demand.head()
Out[4]:
REGION SETTLEMENTDATE TOTALDEMAND PRICECATEGORY
0 VIC1 1/01/2021 0:30 4179.21 LOW
1 VIC1 1/01/2021 1:00 4047.76 LOW
2 VIC1 1/01/2021 1:30 3934.70 LOW
3 VIC1 1/01/2021 2:00 3766.45 LOW
4 VIC1 1/01/2021 2:30 3590.37 LOW
In [5]:
from IPython.display import display, Markdown
#check dimension
rows, columns = price_demand.shape

data = {'Rows': [rows], 'Columns': [columns]}
dimensions_table = pd.DataFrame(data)

display(Markdown("#### Dimensions of price_demand"))
display(dimensions_table)

Dimensions of price_demand¶

Rows Columns
0 11664 4
In [6]:
# check variables type
dtypes_pd = pd.DataFrame(price_demand.dtypes, columns=['Type'])

display(Markdown("#### Data types of price_demand"))
display(dtypes_pd)

Data types of price_demand¶

Type
REGION object
SETTLEMENTDATE object
TOTALDEMAND float64
PRICECATEGORY object
In [7]:
# statistacal summary
description = price_demand.describe()


caption = "#### Summary of 'TOTALDEMAND'"
display(Markdown(f"{caption}"))
display(description)

Summary of 'TOTALDEMAND'¶

TOTALDEMAND
count 11664.000000
mean 4925.798454
std 876.407490
min 2708.530000
25% 4255.500000
50% 4803.755000
75% 5477.337500
max 8196.830000
In [8]:
#check missing values
price_demand_missing = price_demand.isna().sum()


missingv = pd.DataFrame(price_demand_missing, columns=['Missing Values']).reset_index().rename(columns={'index': 'Column'})

display(Markdown("#### Missing Values in price_demand"))
display(missingv)

Missing Values in price_demand¶

Column Missing Values
0 REGION 0
1 SETTLEMENTDATE 0
2 TOTALDEMAND 0
3 PRICECATEGORY 0

Data Introduction:¶

The 'price_demand' dataset comprises energy price and demand figures for the state of Victoria, Australia, during the period between January and August 2021. This information has been sourced from the Australian Energy Market Operator (AEMO) and provides valuable insights into the region's energy market dynamics.

Dataset Specifications:¶

The dataset contains 11,644 observations and 4 variables, captured at 30-minute intervals from 01/01/2021 to 01/09/2021.

1 Region: State (Categorical). 2 TOTALDEMAND: A numerical variable representing the energy demand per 30 mins(Numerical). 3 DATE: A variable indicating the date and time of the observation (datetime64). 4 Price Category: Categorical description of the energy price level (Categorical).

Data Quality:¶

The 'price_demand' dataset is complete and contains no missing values, ensuring a high level of data integrity and reliability for analysis purposes.

1.2 Dataset: 'weather'¶

In [9]:
weather.head()
Out[9]:
Date Minimum temperature (°C) Maximum temperature (°C) Rainfall (mm) Evaporation (mm) Sunshine (hours) Direction of maximum wind gust Speed of maximum wind gust (km/h) Time of maximum wind gust 9am Temperature (°C) 9am relative humidity (%) 9am cloud amount (oktas) 9am wind direction 9am wind speed (km/h) 9am MSL pressure (hPa) 3pm Temperature (°C) 3pm relative humidity (%) 3pm cloud amount (oktas) 3pm wind direction 3pm wind speed (km/h) 3pm MSL pressure (hPa)
0 1/01/2021 15.6 29.9 0.0 2.8 9.3 NNE 31.0 13:14 19.2 77.0 6 N 2 1018.8 28.1 43 5.0 E 13 1015.3
1 2/01/2021 18.4 29.0 0.0 9.4 1.3 NNW 30.0 8:22 23.3 52.0 7 NNW 17 1013.3 28.7 38 7.0 SW 4 1008.5
2 3/01/2021 17.0 26.2 12.6 4.8 7.1 WSW 33.0 17:55 18.3 100.0 8 WSW 4 1007.7 23.5 59 4.0 SSW 2 1005.2
3 4/01/2021 16.0 18.6 2.6 3.8 0.0 SSE 41.0 16:03 16.2 98.0 8 SSE 11 1010.0 18.2 82 8.0 SSW 17 1011.0
4 5/01/2021 15.9 19.1 11.2 1.0 0.0 SSE 35.0 11:02 17.2 96.0 8 SSE 13 1012.5 18.2 82 8.0 SSE 19 1013.3
In [10]:
#check dimension
rows, columns = weather.shape

data1 = {'Rows': [rows], 'Columns': [columns]}
dimensions_table1 = pd.DataFrame(data1)

display(Markdown("#### Dimensions of weather"))
display(dimensions_table1)

Dimensions of weather¶

Rows Columns
0 243 21
In [11]:
# statistical summary
weather.describe()
Out[11]:
Minimum temperature (°C) Maximum temperature (°C) Rainfall (mm) Evaporation (mm) Sunshine (hours) Speed of maximum wind gust (km/h) 9am Temperature (°C) 9am relative humidity (%) 9am cloud amount (oktas) 9am MSL pressure (hPa) 3pm Temperature (°C) 3pm relative humidity (%) 3pm cloud amount (oktas) 3pm MSL pressure (hPa)
count 242.000000 242.000000 241.000000 243.000000 243.000000 240.000000 242.000000 242.000000 243.000000 241.000000 243.000000 243.000000 242.000000 242.000000
mean 11.050826 19.445868 1.576763 3.902469 5.349383 34.412500 13.720661 74.454545 5.164609 1017.740664 18.040329 56.930041 5.301653 1015.824793
std 3.870242 5.354085 4.498754 2.702141 3.604902 10.909319 4.306618 14.177593 2.562778 7.683402 4.963547 14.017376 2.392051 7.435859
min 1.700000 10.600000 0.000000 0.000000 0.000000 15.000000 3.000000 25.000000 0.000000 989.700000 8.600000 21.000000 0.000000 989.000000
25% 8.100000 15.500000 0.000000 1.900000 2.150000 28.000000 10.925000 65.000000 3.000000 1012.800000 14.400000 48.000000 3.000000 1011.000000
50% 10.900000 18.300000 0.000000 3.200000 4.900000 33.000000 13.400000 75.000000 7.000000 1018.100000 17.100000 56.000000 7.000000 1015.750000
75% 13.800000 21.800000 0.600000 5.600000 8.350000 41.000000 16.400000 84.000000 7.000000 1023.700000 20.150000 66.000000 7.000000 1021.600000
max 22.200000 39.200000 43.200000 13.800000 13.100000 67.000000 30.900000 100.000000 8.000000 1034.200000 35.200000 98.000000 8.000000 1032.400000
In [12]:
#check data type


dtypes_w = pd.DataFrame(weather.dtypes, columns=['Type'])
dtypes_w1=dtypes_w.T
display(Markdown("#### Data types of weather"))
display(dtypes_w1)

Data types of weather¶

Date Minimum temperature (°C) Maximum temperature (°C) Rainfall (mm) Evaporation (mm) Sunshine (hours) Direction of maximum wind gust Speed of maximum wind gust (km/h) Time of maximum wind gust 9am Temperature (°C) 9am relative humidity (%) 9am cloud amount (oktas) 9am wind direction 9am wind speed (km/h) 9am MSL pressure (hPa) 3pm Temperature (°C) 3pm relative humidity (%) 3pm cloud amount (oktas) 3pm wind direction 3pm wind speed (km/h) 3pm MSL pressure (hPa)
Type object float64 float64 float64 float64 float64 object float64 object float64 float64 int64 object object float64 float64 int64 float64 object object float64
In [13]:
#check missing values
weather_missing = weather.isna().sum()


missingv1 = pd.DataFrame(weather_missing, columns=['Missing Values']).reset_index().rename(columns={'index': 'Column'})

display(Markdown("#### Missing Values in weather"))
display(missingv1)

Missing Values in weather¶

Column Missing Values
0 Date 0
1 Minimum temperature (°C) 1
2 Maximum temperature (°C) 1
3 Rainfall (mm) 2
4 Evaporation (mm) 0
5 Sunshine (hours) 0
6 Direction of maximum wind gust 3
7 Speed of maximum wind gust (km/h) 3
8 Time of maximum wind gust 3
9 9am Temperature (°C) 1
10 9am relative humidity (%) 1
11 9am cloud amount (oktas) 0
12 9am wind direction 1
13 9am wind speed (km/h) 1
14 9am MSL pressure (hPa) 2
15 3pm Temperature (°C) 0
16 3pm relative humidity (%) 0
17 3pm cloud amount (oktas) 1
18 3pm wind direction 0
19 3pm wind speed (km/h) 0
20 3pm MSL pressure (hPa) 1
In [14]:
# display details of missing values
total_cells_missing = weather.isnull().sum().sum()
total_rows_missing = weather.isnull().any(axis=1).sum()
total_cols_missing = weather.isnull().any(axis=0).sum()


data = {"Total cells with missing values": [total_cells_missing],
        "Total rows with missing values": [total_rows_missing],
        "Total columns with missing values": [total_cols_missing],}

missing_data_summary = pd.DataFrame(data)



display(Markdown("#### Missing data summary"))
display(missing_data_summary)

Missing data summary¶

Total cells with missing values Total rows with missing values Total columns with missing values
0 21 6 13

Data Introduction:¶

The 'weather' dataset comprises key weather indicators for the city of Melbourne, Australia, during the period between January and August 2021. This information has been sourced from the Bureau of Meteorology and provides valuable insights into the region's weather patterns and climate.

Dataset Specifications:¶

The dataset contains 243 observations and 21 variables, captured daily from 01/01/2021 to 31/08/2021.

Here is a summary of the variable types:

Numerical: 15 variables Categorical (string): 4 variables Datetime64: 1 variable Integer: 1 variable

Data Quality:¶

The 'weather' dataset contains some missing values, with a total of 21 cells having missing values across 6 rows and 13 columns.

Section 2: DATA CLEANING¶

2.1 Dataset: 'price-demand'¶

In [15]:
# split the SETTLEMENTDATE into two columns, one is for the date and the other is for the time
price_demand[['DATE','TIME']] = price_demand['SETTLEMENTDATE'].str.split(' ', expand=True)
price_demand.head()

#cast the new date column to datetime format
price_demand['DATE'] = pd.to_datetime(price_demand['DATE'], format='%d/%m/%Y')
date_counts = price_demand['DATE'].value_counts()
In [16]:
#display modified data
price_demand_1 = price_demand
price_demand_1.head()
Out[16]:
REGION SETTLEMENTDATE TOTALDEMAND PRICECATEGORY DATE TIME
0 VIC1 1/01/2021 0:30 4179.21 LOW 2021-01-01 0:30
1 VIC1 1/01/2021 1:00 4047.76 LOW 2021-01-01 1:00
2 VIC1 1/01/2021 1:30 3934.70 LOW 2021-01-01 1:30
3 VIC1 1/01/2021 2:00 3766.45 LOW 2021-01-01 2:00
4 VIC1 1/01/2021 2:30 3590.37 LOW 2021-01-01 2:30
In [17]:
# drop variables
price_demand_2 = price_demand_1.drop(columns=['SETTLEMENTDATE', 'REGION', 'TIME'])

# Print the modified data
price_demand_2.head()
Out[17]:
TOTALDEMAND PRICECATEGORY DATE
0 4179.21 LOW 2021-01-01
1 4047.76 LOW 2021-01-01
2 3934.70 LOW 2021-01-01
3 3766.45 LOW 2021-01-01
4 3590.37 LOW 2021-01-01
In [18]:
#create new variables
grouped_pd = price_demand_1.groupby(['DATE', 'PRICECATEGORY']).size().unstack(fill_value=0)
grouped_pd.columns = [f"number of {col.lower()}" for col in grouped_pd.columns]

# reset the index to merge the data back into the original DataFrame
grouped_pd.reset_index(inplace=True)

# merge the grouped data back into the original DataFrame
price_demand_3 = price_demand_2.merge(grouped_pd, on='DATE')
price_demand_3.head()
Out[18]:
TOTALDEMAND PRICECATEGORY DATE number of extreme number of high number of low number of medium
0 4179.21 LOW 2021-01-01 0 0 47 0
1 4047.76 LOW 2021-01-01 0 0 47 0
2 3934.70 LOW 2021-01-01 0 0 47 0
3 3766.45 LOW 2021-01-01 0 0 47 0
4 3590.37 LOW 2021-01-01 0 0 47 0
In [19]:
# Group by DATE and compute the sum of each day
daily_sum = price_demand_3.groupby('DATE').sum().reset_index()
daily_sum = daily_sum[['DATE', 'TOTALDEMAND']]
daily_sum.columns = ['DATE', 'Daily use']
price_demand_4 = price_demand_3.merge(daily_sum, on='DATE')
price_demand_4.head()
Out[19]:
TOTALDEMAND PRICECATEGORY DATE number of extreme number of high number of low number of medium Daily use
0 4179.21 LOW 2021-01-01 0 0 47 0 185853.37
1 4047.76 LOW 2021-01-01 0 0 47 0 185853.37
2 3934.70 LOW 2021-01-01 0 0 47 0 185853.37
3 3766.45 LOW 2021-01-01 0 0 47 0 185853.37
4 3590.37 LOW 2021-01-01 0 0 47 0 185853.37
In [20]:
#drop variables
price_demand_4 = price_demand_4.drop(columns=['PRICECATEGORY', 'TOTALDEMAND'])

# Print the modified data
price_demand_4.head()
Out[20]:
DATE number of extreme number of high number of low number of medium Daily use
0 2021-01-01 0 0 47 0 185853.37
1 2021-01-01 0 0 47 0 185853.37
2 2021-01-01 0 0 47 0 185853.37
3 2021-01-01 0 0 47 0 185853.37
4 2021-01-01 0 0 47 0 185853.37
In [21]:
#drop duplicated obervations
price_demand_5 = price_demand_4.drop_duplicates(subset=["DATE"])

# Print the DataFrame with unique rows
price_demand_5.tail()
Out[21]:
DATE number of extreme number of high number of low number of medium Daily use
11471 2021-08-28 0 0 19 29 209104.93
11519 2021-08-29 2 3 19 24 224449.35
11567 2021-08-30 0 1 26 21 232158.40
11615 2021-08-31 0 0 31 17 226540.29
11663 2021-09-01 0 0 0 1 4811.27
In [22]:
price_demand_6 = price_demand_5.copy()
# create new column ' month'
price_demand_6['month'] =price_demand_6['DATE'].dt.month


price_demand_6.loc[:, 'month'] = price_demand_6['DATE'].dt.month

price_demand_6.head()
Out[22]:
DATE number of extreme number of high number of low number of medium Daily use month
0 2021-01-01 0 0 47 0 185853.37 1
47 2021-01-02 0 0 48 0 197990.13 1
95 2021-01-03 0 0 48 0 188742.96 1
143 2021-01-04 0 0 48 0 199281.07 1
191 2021-01-05 0 0 48 0 207680.91 1
In [23]:
def computing_day(day):
    if day.weekday() < 5:
        return 'weekday'
    else:
        return 'weekend'


#create new column 'type of day'
price_demand_7 = price_demand_6.copy()

price_demand_7['type of day'] =price_demand_7['DATE'].apply(computing_day)

price_demand_7.head()
Out[23]:
DATE number of extreme number of high number of low number of medium Daily use month type of day
0 2021-01-01 0 0 47 0 185853.37 1 weekday
47 2021-01-02 0 0 48 0 197990.13 1 weekend
95 2021-01-03 0 0 48 0 188742.96 1 weekend
143 2021-01-04 0 0 48 0 199281.07 1 weekday
191 2021-01-05 0 0 48 0 207680.91 1 weekday
In [24]:
#create new column 'weekday'
price_demand_8 = price_demand_7.copy()

dummy_vars = pd.get_dummies(price_demand_8['type of day'], prefix='', prefix_sep='')

# Concatenate the original DataFrame and the dummy variables
price_demand_8 = pd.concat([price_demand_8, dummy_vars], axis=1)

# Drop the original 'type_of_day' column
price_demand_8 = price_demand_8.drop('type of day', axis=1)
price_demand_8 = price_demand_8.drop('weekend', axis=1)
price_demand_8.head()
Out[24]:
DATE number of extreme number of high number of low number of medium Daily use month weekday
0 2021-01-01 0 0 47 0 185853.37 1 1
47 2021-01-02 0 0 48 0 197990.13 1 0
95 2021-01-03 0 0 48 0 188742.96 1 0
143 2021-01-04 0 0 48 0 199281.07 1 1
191 2021-01-05 0 0 48 0 207680.91 1 1
In [25]:
#remove observations related particular dates
price_demand_8 = price_demand_8[price_demand_8['DATE'] != '2021-09-01']
price_demand_8 = price_demand_8[price_demand_8['DATE'] != '2021-01-01']
In [26]:
#create new columns related to season
price_demand_9 = price_demand_8.copy()
price_demand_9["SEASON"] = price_demand_9["DATE"].dt.month.apply(lambda x: "Summer" if x in [1, 2, 12] else "Autumn" if x in [3, 4, 5] else "Winter")

# Convert the season columns to 1s and 0s
season_cols = [ "Summer", "Autumn", "Winter"]

price_demand_9['SEASON'].value_counts()
for col in season_cols:
    price_demand_9[col] = price_demand_9["SEASON"].apply(lambda x: 1 if x == col else 0)

price_demand_9.drop('SEASON',inplace=True,axis=1)
price_demand_9.head(10)
Out[26]:
DATE number of extreme number of high number of low number of medium Daily use month weekday Summer Autumn Winter
47 2021-01-02 0 0 48 0 197990.13 1 0 1 0 0
95 2021-01-03 0 0 48 0 188742.96 1 0 1 0 0
143 2021-01-04 0 0 48 0 199281.07 1 1 1 0 0
191 2021-01-05 0 0 48 0 207680.91 1 1 1 0 0
239 2021-01-06 0 0 48 0 201497.24 1 1 1 0 0
287 2021-01-07 0 0 48 0 201345.90 1 1 1 0 0
335 2021-01-08 0 0 46 2 207526.40 1 1 1 0 0
383 2021-01-09 0 0 42 6 212630.42 1 0 1 0 0
431 2021-01-10 0 0 34 14 230588.61 1 0 1 0 0
479 2021-01-11 0 0 37 11 290620.38 1 1 1 0 0
In [27]:
#check dimension
rows, columns = price_demand_9.shape

data = {'Rows': [rows], 'Columns': [columns]}
dimensions_table = pd.DataFrame(data)

display(Markdown("#### Dimensions of updated price_demand "))
display(dimensions_table)

Dimensions of updated price_demand¶

Rows Columns
0 242 11

To ensure the data's quality and suitability for analysis, we undertook the following steps to clean and preprocess it:¶

1 Splited 'SETTLEMENTDATE' into 'DATE' and 'TIME' columns.

2 Converted 'DATE' to datetime format.

3 Removed 'SETTLEMENTDATE' and 'TIME' columns, as our aim was to restructure the data into a daily format.

4 Removed 'REGION' variable, rendering it irrelevant for the purposes of our analysis.

5 Added columns for price categories: 'Number of Extreme', 'Number of High', 'Number of Low', and 'Number of Medium'. This transformation enables more efficient data analysis and interpretation.

6 Calculated 'Daily Use' column  which aggregates the values from the 'TOTALDEMAND' column for each day.

7 Removed 'PRICECATEGORY' and 'TOTALDEMAND' columns, converting data to daily format.

8 Added 'Month' variable extracted from the 'DATE' column..

9 Added 'Type of Day' variable ('Weekday' or 'Weekend') extracted information from the 'Date' column.

10 Added 'Weekday' dummy variable and remove 'Type of Day' column.

11 Removed rows with incomplete data for 01/01/2021 and 01/09/2021.

12 Added seasonal variables: 'spring', 'summer', 'autumn', and 'winter' derived from 'Month' column.

Now the dataset has 242 observations and 11 variables.

2.2 Dataset: 'weather'¶

In [28]:
weather_1 = weather.copy()
In [29]:
# convert the format of observation 'Date'
weather_1['Date'] = pd.to_datetime(weather_1['Date'], format='%d/%m/%Y')
In [30]:
#remove observations related particular dates
weather_1 = weather_1[weather_1['Date'] != '2021-09-01']
weather_1 = weather_1[weather_1['Date'] != '2021-01-01']

To guarantee the quality and appropriateness of the data for analysis, we implemented the following steps to clean and preprocess the information:¶

1 Converted the 'Date' column to a datetime64 format.

2 Removal of Rows with date 01/01/2021 and 01/09/2021 to match the price_demand dataset.

2.3 Data merging¶

In [31]:
# merge data
merged = pd.merge(price_demand_9, weather_1, left_on='DATE', right_on='Date', how='inner')
display(Markdown("#### Merged Dataframe with Price_Demand and weather"))

# Displaying the table
display(merged.head(2))

Merged Dataframe with Price_Demand and weather¶

DATE number of extreme number of high number of low number of medium Daily use month weekday Summer Autumn Winter Date Minimum temperature (°C) Maximum temperature (°C) Rainfall (mm) Evaporation (mm) Sunshine (hours) Direction of maximum wind gust Speed of maximum wind gust (km/h) Time of maximum wind gust 9am Temperature (°C) 9am relative humidity (%) 9am cloud amount (oktas) 9am wind direction 9am wind speed (km/h) 9am MSL pressure (hPa) 3pm Temperature (°C) 3pm relative humidity (%) 3pm cloud amount (oktas) 3pm wind direction 3pm wind speed (km/h) 3pm MSL pressure (hPa)
0 2021-01-02 0 0 48 0 197990.13 1 0 1 0 0 2021-01-02 18.4 29.0 0.0 9.4 1.3 NNW 30.0 8:22 23.3 52.0 7 NNW 17 1013.3 28.7 38 7.0 SW 4 1008.5
1 2021-01-03 0 0 48 0 188742.96 1 0 1 0 0 2021-01-03 17.0 26.2 12.6 4.8 7.1 WSW 33.0 17:55 18.3 100.0 8 WSW 4 1007.7 23.5 59 4.0 SSW 2 1005.2
In [32]:
# drop column 'Date'
merged.drop('Date', axis=1, inplace=True)
In [33]:
#check dimension
rows, columns = merged.shape

data2 = {'Rows': [rows], 'Columns': [columns]}
dimensions_table2 = pd.DataFrame(data2)

display(Markdown("#### Dimensions of 'merged'"))
display(dimensions_table2)

Dimensions of 'merged'¶

Rows Columns
0 242 31
In [34]:
# check missing values
merged_missing = merged.isna().sum()


missingv2 = pd.DataFrame(merged_missing, columns=['Missing Values']).reset_index().rename(columns={'index': 'Column'})

display(Markdown("#### Missing Values in 'merged' "))
display(missingv2)

Missing Values in 'merged'¶

Column Missing Values
0 DATE 0
1 number of extreme 0
2 number of high 0
3 number of low 0
4 number of medium 0
5 Daily use 0
6 month 0
7 weekday 0
8 Summer 0
9 Autumn 0
10 Winter 0
11 Minimum temperature (°C) 1
12 Maximum temperature (°C) 1
13 Rainfall (mm) 2
14 Evaporation (mm) 0
15 Sunshine (hours) 0
16 Direction of maximum wind gust 3
17 Speed of maximum wind gust (km/h) 3
18 Time of maximum wind gust 3
19 9am Temperature (°C) 1
20 9am relative humidity (%) 1
21 9am cloud amount (oktas) 0
22 9am wind direction 1
23 9am wind speed (km/h) 1
24 9am MSL pressure (hPa) 2
25 3pm Temperature (°C) 0
26 3pm relative humidity (%) 0
27 3pm cloud amount (oktas) 1
28 3pm wind direction 0
29 3pm wind speed (km/h) 0
30 3pm MSL pressure (hPa) 1
In [35]:
# given there are only 6 rows containing missing value, we simply choose to drop them as a way to handle missing value
merged.dropna(inplace=True)
In [36]:
# continue cleaning the dataset to ensure all the columns are named properly(capitalized the first letter)
merged.columns = merged.columns.str.title()
display(merged.head(2))
Date Number Of Extreme Number Of High Number Of Low Number Of Medium Daily Use Month Weekday Summer Autumn Winter Minimum Temperature (°C) Maximum Temperature (°C) Rainfall (Mm) Evaporation (Mm) Sunshine (Hours) Direction Of Maximum Wind Gust Speed Of Maximum Wind Gust (Km/H) Time Of Maximum Wind Gust 9Am Temperature (°C) 9Am Relative Humidity (%) 9Am Cloud Amount (Oktas) 9Am Wind Direction 9Am Wind Speed (Km/H) 9Am Msl Pressure (Hpa) 3Pm Temperature (°C) 3Pm Relative Humidity (%) 3Pm Cloud Amount (Oktas) 3Pm Wind Direction 3Pm Wind Speed (Km/H) 3Pm Msl Pressure (Hpa)
0 2021-01-02 0 0 48 0 197990.13 1 0 1 0 0 18.4 29.0 0.0 9.4 1.3 NNW 30.0 8:22 23.3 52.0 7 NNW 17 1013.3 28.7 38 7.0 SW 4 1008.5
1 2021-01-03 0 0 48 0 188742.96 1 0 1 0 0 17.0 26.2 12.6 4.8 7.1 WSW 33.0 17:55 18.3 100.0 8 WSW 4 1007.7 23.5 59 4.0 SSW 2 1005.2
In [37]:
#check data type
dtypes_m = pd.DataFrame(merged.dtypes, columns=['Type'])
dtypes_m1=dtypes_w.T
display(Markdown("#### Data types of 'merged' "))
display(dtypes_m1)

Data types of 'merged'¶

Date Minimum temperature (°C) Maximum temperature (°C) Rainfall (mm) Evaporation (mm) Sunshine (hours) Direction of maximum wind gust Speed of maximum wind gust (km/h) Time of maximum wind gust 9am Temperature (°C) 9am relative humidity (%) 9am cloud amount (oktas) 9am wind direction 9am wind speed (km/h) 9am MSL pressure (hPa) 3pm Temperature (°C) 3pm relative humidity (%) 3pm cloud amount (oktas) 3pm wind direction 3pm wind speed (km/h) 3pm MSL pressure (hPa)
Type object float64 float64 float64 float64 float64 object float64 object float64 float64 int64 object object float64 float64 int64 float64 object object float64

We undertook the following steps to merge and preprocess two datasets:¶

1 Merged 'price_demand_9' and 'weather_1' datasets.

2 Removed the 'Date' variable, as it was redundant due to the presence of another variable containing identical information.

3 Addressed missing values by initially attempting the moving average and median/mode imputation methods, using actual data to evaluate accuracy. However, the estimated values displayed significant discrepancies from the actual values. This is attributed to the employed methods being less sensitive to large changes, while the trend of energy usage experiences dramatic fluctuations over time. Given that there are only six rows (less than 3%) with missing values, we opted to simply remove them as a method for handling missing data.

4 Renamed each column in a proper manner.

After merging and cleaning,the dataset 'merge' has 236 observations and 31 variables.

Section 3: DATA EXPLORATION¶

3.1 Exploratory Analysis¶

In [38]:
import matplotlib.pyplot as plt
from collections import Counter


# count the value for column
counts = price_demand['PRICECATEGORY'].value_counts()

# create a bar plot
counts.plot(kind='bar')

# add labels and title
plt.xlabel('Price Category')
plt.ylabel('Count')
plt.title('Price Category Histogram')

# display the plot
plt.show()
In [39]:
import plotly.express as px



#plot line chart
fig1 = px.line(price_demand_9, x='DATE', y='Daily use',title='Daily Use by Date')


fig1.show()

From the line chart above, we have observed fluctuations over time, characterized by pronounced upward and downward trends. A relatively stable period of consumption was evident between March and June, whereas January, February, July, and August exhibited more substantial variations in energy usage.

In [40]:
import plotly.express as px


#draw scatter plot
fig2 = px.scatter(price_demand_7, x='DATE', y='Daily use', color='type of day',title='Daily Use by Date and Type of Day')
fig2.show()

Upon examination of the scatter plot provided, it is evident that there are distinct variations in energy usage between weekdays and weekends. Energy consumption generally demonstrates a decrease during weekends. Nonetheless, it is crucial to highlight two particular weekends, specifically on the dates 24/01 and 20/02, where an anomalously elevated level of energy usage was observed.

In [41]:
merged.drop(labels=['Date','Direction Of Maximum Wind Gust ', '9Am Wind Speed (Km/H)','3Pm Wind Speed (Km/H)', '9Am Wind Direction', '3Pm Wind Direction', 'Time Of Maximum Wind Gust'], inplace=True, axis=1)
In [42]:
import seaborn as sns
import matplotlib.pyplot as plt

#create scatter plot metric
x_cols = merged.columns.to_list()

plot_kws = {'color': 'black', 'alpha': 0.3}



sns.pairplot(merged, y_vars=['Daily Use'], x_vars=x_cols[:4], plot_kws=plot_kws)

sns.pairplot(merged, y_vars=['Daily Use'], x_vars=x_cols[5:9], plot_kws=plot_kws)

sns.pairplot(merged, y_vars=['Daily Use'], x_vars=x_cols[9:13], plot_kws=plot_kws)

sns.pairplot(merged, y_vars=['Daily Use'], x_vars=x_cols[13:17], plot_kws=plot_kws)

sns.pairplot(merged, y_vars=['Daily Use'], x_vars=x_cols[17:21], plot_kws=plot_kws)

sns.pairplot(merged, y_vars=['Daily Use'], x_vars=x_cols[21:24], plot_kws=plot_kws)

sns.set(font_scale=0.5)

display(Markdown("### Scatter Plot Matrix of Daily Use vs. Other Variables"))

plt.show()

Scatter Plot Matrix of Daily Use vs. Other Variables¶

It has been observed that the variables 'Minimum Temperature (°C)', 'Maximum Temperature (°C)', '9AM Temperature (°C)' and '3PM Temperature (°C)' exhibit a distinctive V-shaped pattern with respect to the 'Daily Use' parameter, indicating higher energy demand during extreme temperatures.

It should be noticed that in summer, energy demand rises with temperature, while in winter, it increases as temperature falls, reflecting the seasons' impact on consumption patterns.

As the relationship between temperature and the energy consumption is not linear, with moderate temperatures reducing usage and extreme temperatures increasing it. Multiple regression with polynomial terms can capture this non-linearity.

Furthermore, there is a clear positive linear correlation exists between 'Month' and energy consumption, but relationships with other factors need more investigation.

3.2 Initial feature selection¶

Our initial feature selection is to exclude the following variables:

1 'Date': Instead of using this variable, we incorporated 'month', 'weekday', and 'season' to better capture the relationship between time and daily energy usage.

2 '3Pm Wind Speed (Km/H)' and '9Am Wind Speed (Km/H)': We excluded these variables due to undefined string values, such as 'calm'.

3 'Direction Of Maximum Wind Gust', '9Am Wind Direction', '3Pm Wind Direction': With over 10 unique wind direction values, creating new columns for each is impractical, so we omitted these variables.

4 'Time Of Maximum Wind Gust': Our analysis indicated that this feature is unlikely to significantly influence daily energy usage, so we excluded it from the model.

In [43]:
#display the data
merged.head()
Out[43]:
Number Of Extreme Number Of High Number Of Low Number Of Medium Daily Use Month Weekday Summer Autumn Winter Minimum Temperature (°C) Maximum Temperature (°C) Rainfall (Mm) Evaporation (Mm) Sunshine (Hours) Speed Of Maximum Wind Gust (Km/H) 9Am Temperature (°C) 9Am Relative Humidity (%) 9Am Cloud Amount (Oktas) 9Am Msl Pressure (Hpa) 3Pm Temperature (°C) 3Pm Relative Humidity (%) 3Pm Cloud Amount (Oktas) 3Pm Msl Pressure (Hpa)
0 0 0 48 0 197990.13 1 0 1 0 0 18.4 29.0 0.0 9.4 1.3 30.0 23.3 52.0 7 1013.3 28.7 38 7.0 1008.5
1 0 0 48 0 188742.96 1 0 1 0 0 17.0 26.2 12.6 4.8 7.1 33.0 18.3 100.0 8 1007.7 23.5 59 4.0 1005.2
2 0 0 48 0 199281.07 1 1 1 0 0 16.0 18.6 2.6 3.8 0.0 41.0 16.2 98.0 8 1010.0 18.2 82 8.0 1011.0
3 0 0 48 0 207680.91 1 1 1 0 0 15.9 19.1 11.2 1.0 0.0 35.0 17.2 96.0 8 1012.5 18.2 82 8.0 1013.3
4 0 0 48 0 201497.24 1 1 1 0 0 13.7 19.2 1.2 1.0 3.2 35.0 15.2 72.0 7 1020.0 18.1 63 7.0 1020.0
In [44]:
import pandas as pd
# display correlation metric
cormat = merged.corr()
round(cormat,2)
Out[44]:
Number Of Extreme Number Of High Number Of Low Number Of Medium Daily Use Month Weekday Summer Autumn Winter Minimum Temperature (°C) Maximum Temperature (°C) Rainfall (Mm) Evaporation (Mm) Sunshine (Hours) Speed Of Maximum Wind Gust (Km/H) 9Am Temperature (°C) 9Am Relative Humidity (%) 9Am Cloud Amount (Oktas) 9Am Msl Pressure (Hpa) 3Pm Temperature (°C) 3Pm Relative Humidity (%) 3Pm Cloud Amount (Oktas) 3Pm Msl Pressure (Hpa)
Number Of Extreme 1.00 0.67 -0.50 0.17 0.34 0.24 0.03 -0.20 -0.05 0.22 -0.33 -0.29 0.02 -0.29 -0.15 -0.29 -0.34 0.22 0.04 0.15 -0.28 0.13 0.00 0.16
Number Of High 0.67 1.00 -0.75 0.42 0.54 0.39 0.11 -0.32 -0.17 0.45 -0.48 -0.44 -0.03 -0.41 -0.23 -0.24 -0.48 0.30 -0.01 0.12 -0.43 0.21 0.06 0.12
Number Of Low -0.50 -0.75 1.00 -0.91 -0.65 -0.66 -0.07 0.54 0.14 -0.62 0.65 0.60 -0.01 0.59 0.34 0.24 0.63 -0.39 0.03 -0.23 0.60 -0.28 -0.11 -0.22
Number Of Medium 0.17 0.42 -0.91 1.00 0.56 0.66 0.04 -0.54 -0.11 0.59 -0.60 -0.56 0.02 -0.55 -0.32 -0.16 -0.57 0.34 -0.05 0.23 -0.55 0.25 0.13 0.21
Daily Use 0.34 0.54 -0.65 0.56 1.00 0.52 0.46 -0.33 -0.25 0.54 -0.46 -0.34 -0.06 -0.28 -0.30 0.09 -0.38 0.17 0.00 -0.02 -0.36 0.19 0.18 -0.05
Month 0.24 0.39 -0.66 0.66 0.52 1.00 -0.01 -0.75 -0.18 0.84 -0.75 -0.69 -0.03 -0.56 -0.27 0.14 -0.73 0.23 -0.12 0.11 -0.67 0.12 0.10 0.08
Weekday 0.03 0.11 -0.07 0.04 0.46 -0.01 1.00 -0.01 0.02 -0.01 0.00 -0.03 -0.07 -0.04 -0.08 0.08 0.01 0.01 0.13 0.01 -0.01 0.02 0.06 -0.02
Summer -0.20 -0.32 0.54 -0.54 -0.33 -0.75 -0.01 1.00 -0.45 -0.44 0.61 0.58 0.01 0.59 0.32 0.04 0.61 -0.29 0.10 -0.23 0.56 -0.17 -0.10 -0.22
Autumn -0.05 -0.17 0.14 -0.11 -0.25 -0.18 0.02 -0.45 1.00 -0.61 0.09 0.09 0.02 -0.08 -0.07 -0.25 0.08 0.03 -0.04 0.27 0.09 -0.02 0.00 0.29
Winter 0.22 0.45 -0.62 0.59 0.54 0.84 -0.01 -0.44 -0.61 1.00 -0.64 -0.60 -0.03 -0.43 -0.22 0.22 -0.62 0.23 -0.05 -0.07 -0.59 0.17 0.09 -0.09
Minimum Temperature (°C) -0.33 -0.48 0.65 -0.60 -0.46 -0.75 0.00 0.61 0.09 -0.64 1.00 0.70 0.04 0.66 0.10 0.03 0.91 -0.32 0.19 -0.32 0.66 -0.05 0.05 -0.28
Maximum Temperature (°C) -0.29 -0.44 0.60 -0.56 -0.34 -0.69 -0.03 0.58 0.09 -0.60 0.70 1.00 -0.13 0.63 0.48 -0.07 0.82 -0.31 -0.19 -0.09 0.97 -0.46 -0.25 -0.18
Rainfall (Mm) 0.02 -0.03 -0.01 0.02 -0.06 -0.03 -0.07 0.01 0.02 -0.03 0.04 -0.13 1.00 -0.05 -0.15 0.04 -0.03 0.20 0.07 -0.14 -0.13 0.19 0.11 -0.06
Evaporation (Mm) -0.29 -0.41 0.59 -0.55 -0.28 -0.56 -0.04 0.59 -0.08 -0.43 0.66 0.63 -0.05 1.00 0.29 0.14 0.71 -0.52 0.00 -0.27 0.56 -0.21 -0.10 -0.25
Sunshine (Hours) -0.15 -0.23 0.34 -0.32 -0.30 -0.27 -0.08 0.32 -0.07 -0.22 0.10 0.48 -0.15 0.29 1.00 -0.04 0.22 -0.26 -0.59 0.21 0.50 -0.56 -0.72 0.14
Speed Of Maximum Wind Gust (Km/H) -0.29 -0.24 0.24 -0.16 0.09 0.14 0.08 0.04 -0.25 0.22 0.03 -0.07 0.04 0.14 -0.04 1.00 0.08 -0.37 0.00 -0.42 -0.12 -0.08 0.13 -0.43
9Am Temperature (°C) -0.34 -0.48 0.63 -0.57 -0.38 -0.73 0.01 0.61 0.08 -0.62 0.91 0.82 -0.03 0.71 0.22 0.08 1.00 -0.46 0.05 -0.27 0.76 -0.16 -0.01 -0.28
9Am Relative Humidity (%) 0.22 0.30 -0.39 0.34 0.17 0.23 0.01 -0.29 0.03 0.23 -0.32 -0.31 0.20 -0.52 -0.26 -0.37 -0.46 1.00 0.08 0.13 -0.26 0.40 0.04 0.14
9Am Cloud Amount (Oktas) 0.04 -0.01 0.03 -0.05 0.00 -0.12 0.13 0.10 -0.04 -0.05 0.19 -0.19 0.07 0.00 -0.59 0.00 0.05 0.08 1.00 -0.19 -0.20 0.43 0.39 -0.12
9Am Msl Pressure (Hpa) 0.15 0.12 -0.23 0.23 -0.02 0.11 0.01 -0.23 0.27 -0.07 -0.32 -0.09 -0.14 -0.27 0.21 -0.42 -0.27 0.13 -0.19 1.00 -0.03 -0.14 -0.29 0.96
3Pm Temperature (°C) -0.28 -0.43 0.60 -0.55 -0.36 -0.67 -0.01 0.56 0.09 -0.59 0.66 0.97 -0.13 0.56 0.50 -0.12 0.76 -0.26 -0.20 -0.03 1.00 -0.55 -0.28 -0.13
3Pm Relative Humidity (%) 0.13 0.21 -0.28 0.25 0.19 0.12 0.02 -0.17 -0.02 0.17 -0.05 -0.46 0.19 -0.21 -0.56 -0.08 -0.16 0.40 0.43 -0.14 -0.55 1.00 0.35 -0.03
3Pm Cloud Amount (Oktas) 0.00 0.06 -0.11 0.13 0.18 0.10 0.06 -0.10 0.00 0.09 0.05 -0.25 0.11 -0.10 -0.72 0.13 -0.01 0.04 0.39 -0.29 -0.28 0.35 1.00 -0.23
3Pm Msl Pressure (Hpa) 0.16 0.12 -0.22 0.21 -0.05 0.08 -0.02 -0.22 0.29 -0.09 -0.28 -0.18 -0.06 -0.25 0.14 -0.43 -0.28 0.14 -0.12 0.96 -0.13 -0.03 -0.23 1.00

In our approach to feature selection, we employed correlation coefficients method, to determine the linear or rank correlation between the target variable and each individual feature. By identifying features with the highest absolute values, we can effectively select those that are most relevant for our analysis.

Applying the correlation metric outlined above, we establish an absolute value threshold of 0.3. Consequently, the following features are deemed less significant for our regression model predicting energy usage: 'Rainfall (mm)', 'Evaporation (mm)', 'Speed of Maximum Wind Gust (km/h)', '9AM Relative Humidity (%)', '9AM Cloud Amount (oktas)', '9AM MSL Pressure (hPa)', '3PM Relative Humidity (%)', '3PM Cloud Amount (oktas)', and '3PM MSL Pressure (hPa)'.

It should be noticed that a multicollinearity concern has been identified, as a substantial correlation has been observed among several predictor variables.

We have observed multicollinearity among predictor groups that share similar categories, such as temperature, price category, and time related group. To optimize the model, we tend to remove some of these correlated predictors in the next section for better performance and interpretability.

3.3 Creating new features¶

In [45]:
import numpy as np
merged_1 = merged.copy()
#create new features
merged_1["Min Temp Cubed"] = merged_1["Minimum Temperature (°C)"].apply(lambda x: x**3)
merged_1["Max Temp Cubed"] = merged_1["Maximum Temperature (°C)"].apply(lambda x: x**3)
merged_1["9am Temp Cubed"] = merged_1["9Am Temperature (°C)"].apply(lambda x: x**3)
merged_1["3pm Temp Cubed"] = merged_1["3Pm Temperature (°C)"].apply(lambda x: x**3)
In [46]:
#create new features
merged_1['Month:MinTem'] = merged_1['Month'] * merged_1['Minimum Temperature (°C)']
merged_1['Month:MaxTem'] = merged_1['Month'] * merged_1['Maximum Temperature (°C)']

To improve the predicting power of regression model. We created 2 types of new features:

1 Polynomial terms: To better represent the non-linear V-shaped pattern between energy use and temperature-related variables, we generated cubic polynomial terms for four predictors: 'Min Temp Cubed,' 'Max Temp Cubed,' '9am Temp Cubed,' and '3pm Temp Cubed.'

2 Interaction terms: To account for the relationship between energy demand and temperature fluctuations during different seasons, we combined the effects of 'month' and maximum/minimum temperature. This resulted in two new features: 'Month:MinTem' and 'Month:MaxTem,' which capture the varying impact of temperature on energy usage based on the month.

In [47]:
#display data
merged_1.head()
Out[47]:
Number Of Extreme Number Of High Number Of Low Number Of Medium Daily Use Month Weekday Summer Autumn Winter Minimum Temperature (°C) Maximum Temperature (°C) Rainfall (Mm) Evaporation (Mm) Sunshine (Hours) Speed Of Maximum Wind Gust (Km/H) 9Am Temperature (°C) 9Am Relative Humidity (%) 9Am Cloud Amount (Oktas) 9Am Msl Pressure (Hpa) 3Pm Temperature (°C) 3Pm Relative Humidity (%) 3Pm Cloud Amount (Oktas) 3Pm Msl Pressure (Hpa) Min Temp Cubed Max Temp Cubed 9am Temp Cubed 3pm Temp Cubed Month:MinTem Month:MaxTem
0 0 0 48 0 197990.13 1 0 1 0 0 18.4 29.0 0.0 9.4 1.3 30.0 23.3 52.0 7 1013.3 28.7 38 7.0 1008.5 6229.504 24389.000 12649.337 23639.903 18.4 29.0
1 0 0 48 0 188742.96 1 0 1 0 0 17.0 26.2 12.6 4.8 7.1 33.0 18.3 100.0 8 1007.7 23.5 59 4.0 1005.2 4913.000 17984.728 6128.487 12977.875 17.0 26.2
2 0 0 48 0 199281.07 1 1 1 0 0 16.0 18.6 2.6 3.8 0.0 41.0 16.2 98.0 8 1010.0 18.2 82 8.0 1011.0 4096.000 6434.856 4251.528 6028.568 16.0 18.6
3 0 0 48 0 207680.91 1 1 1 0 0 15.9 19.1 11.2 1.0 0.0 35.0 17.2 96.0 8 1012.5 18.2 82 8.0 1013.3 4019.679 6967.871 5088.448 6028.568 15.9 19.1
4 0 0 48 0 201497.24 1 1 1 0 0 13.7 19.2 1.2 1.0 3.2 35.0 15.2 72.0 7 1020.0 18.1 63 7.0 1020.0 2571.353 7077.888 3511.808 5929.741 13.7 19.2
In [48]:
#check dimension
#check dimension
rows, columns = merged_1.shape

data6 = {'Rows': [rows], 'Columns': [columns]}
dimensions_table4 = pd.DataFrame(data6)

display(Markdown("#### Dimensions of merged_1"))
display(dimensions_table4)

Dimensions of merged_1¶

Rows Columns
0 236 30

The dataset 'merged_1' now have 236 observations and 30 variables without missing value.

Section 4: COMPUTING THE REGRESSION MODELS¶

Our process of computing regression models includes following steps:

1 Splited dataset: Divide data into training and testing subsets for model building and evaluation.

2 Set random seeds: Ensure consistent and reproducible results by setting random seeds before training.

3 Standardized features: Scale and standardize features for better model performance.

4 Selected features: Use PCA or correlation coefficients method to identify and include the most important features in regression models.

5 Used cross-validation: Apply 5-fold cross-validation to reduce bias and get reliable performance estimates.

6 Evaluated error metrics: Assess model performance with RMSE, MAE, and MAPE for different feature counts (K).

7 Examined statistics: Analyze the statistical summary, focusing on coefficients, p-values, and goodness-of-fit metrics.

8 Compared models: Review regression models' performance and fit to select the most appropriate model.

4.1 Computing the first model (model_0)¶

In [49]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.feature_selection import SelectKBest, f_regression 
from IPython.display import display

# setup function to cauculate MAPE
def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100


# set updenpendent and independent variables
columns_to_drop = ['Daily Use']
X0 = merged.drop(columns_to_drop, axis=1)
y0 = merged['Daily Use']

# standardize the features except 'Weekday'
scaler = StandardScaler()
X0_scaled = X0.copy()
X0_scaled.loc[:, X0_scaled.columns != 'Weekday'] = scaler.fit_transform(X0.loc[:, X0.columns != 'Weekday'])

results0 = []

#loop out the error metric for k in selected range
for k in range(1, 24):
    selector = SelectKBest(score_func=f_regression, k=k)
    X0_best_features = selector.fit_transform(X0_scaled, y0)
    
    
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    lm = LinearRegression()

    rmse_scores0 = []
    mae_scores0 = []
    mape_scores0 = []
    r2_scores0 = []
    adjusted_r2_scores0 = []

    for train_index, test_index in kf.split(X0_best_features):
        X_train0, X_test0= X0_best_features[train_index], X0_best_features[test_index]
        y_train0, y_test0 = y0.iloc[train_index], y0.iloc[test_index]

        model_0 = lm.fit(X_train0, y_train0)
        y_pred0 = lm.predict(X_test0)

        rmse0 = np.sqrt(mean_squared_error(y_test0, y_pred0))
        mae0 = mean_absolute_error(y_test0, y_pred0)
        mape0 = mean_absolute_percentage_error(y_test0, y_pred0)
     
        
        rmse_scores0.append(rmse0)
        mae_scores0.append(mae0)
        mape_scores0.append(mape0)
      
                                                               
                                                               
    avg_rmse0 = np.mean(rmse_scores0)
    avg_mae0 = np.mean(mae_scores0)
    avg_mape0 = np.mean(mape_scores0)
  

    results0.append([k, avg_rmse0, avg_mae0, avg_mape0])

# create a DataFrame with the results
columns = ['K', 'Average RMSE', 'Average MAE', 'Average MAPE']
results_df0 = pd.DataFrame(results0, columns=columns)

# display the results as a table
display(Markdown("#### Model_0: Error Metric with K"))
display(results_df0)

Model_0: Error Metric with K¶

K Average RMSE Average MAE Average MAPE
0 1 19827.694309 15755.788963 6.785999
1 2 19812.484195 15845.667944 6.825048
2 3 19847.301024 15934.037434 6.863746
3 4 19337.849251 15222.538257 6.572814
4 5 19450.204539 15231.908482 6.576412
5 6 15675.625110 11906.911437 5.087793
6 7 15775.166358 11981.998545 5.115063
7 8 15376.484923 11987.946208 5.116251
8 9 15351.687733 11969.679931 5.108056
9 10 15351.687733 11969.679931 5.108056
10 11 15055.655637 11773.906589 5.017770
11 12 15063.412122 11834.740366 5.054922
12 13 14494.496705 11334.305675 4.842552
13 14 14368.984700 11406.994907 4.870260
14 15 14364.559931 11394.036569 4.866781
15 16 14559.703465 11438.526684 4.874333
16 17 14574.710206 11496.693956 4.899573
17 18 14589.988414 11525.274231 4.910701
18 19 14222.189128 11108.421988 4.739425
19 20 14232.893748 11119.996322 4.744972
20 21 14334.217792 11224.352618 4.788702
21 22 14304.313767 11278.616569 4.815530
22 23 14275.852917 11258.095010 4.809062

Initially, we executed model_0 using variables without polynomial and interaction terms, and then employed PCA for feature selection purposes. This is because PCA can help to address multicollinearity and reduce overfitting by using fewer, uncorrelated predictors.

Then we identified the optimal K value by evaluating the error metrics, which included the average Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE).

It becomes evident that the most favorable model is achieved with K = 19.

In this case, we tend to keep using PCA instead of the correlation coefficient method for feature selection of model_2 since our primary goal is to create a predictive model and we are less concerned about interpreting the relationship between the original predictors and the response variable.

4.2 Computing second model (model_2)¶

In [50]:
# setup function to cauculate MAPE
def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100


# set updenpendent and independent variables
columns_to_drop = ['Daily Use']
X2 = merged_1.drop(columns_to_drop, axis=1)
y2 = merged_1['Daily Use']

# Standardize the features except 'Weekday'
scaler = StandardScaler()
X2_scaled = X2.copy()
X2_scaled.loc[:, X2_scaled.columns != 'Weekday'] = scaler.fit_transform(X2.loc[:, X2.columns != 'Weekday'])


results2 = []

for k in range(1, 30):
    selector = SelectKBest(score_func=f_regression, k=k)
    X2_best_features = selector.fit_transform(X2_scaled, y2)
    
    
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    lm = LinearRegression()

    rmse_scores2 = []
    mae_scores2 = []
    mape_scores2 = []

    for train_index, test_index in kf.split(X2_best_features):
        X_train2, X_test2 = X2_best_features[train_index], X2_best_features[test_index]
        y_train2, y_test2 = y2.iloc[train_index], y2.iloc[test_index]

        model_2 = lm.fit(X_train2, y_train2)
        y_pred2 = lm.predict(X_test2)

        rmse2 = np.sqrt(mean_squared_error(y_test2, y_pred2))
        mae2 = mean_absolute_error(y_test2, y_pred2)
        mape2 = mean_absolute_percentage_error(y_test2, y_pred2)

        rmse_scores2.append(rmse2)
        mae_scores2.append(mae2)
        mape_scores2.append(mape2)

    avg_rmse2 = np.mean(rmse_scores2)
    avg_mae2 = np.mean(mae_scores2)
    avg_mape2 = np.mean(mape_scores2)

    results2.append([k, avg_rmse2, avg_mae2, avg_mape2])

# Create a DataFrame with the results
columns = ['K', 'Average RMSE', 'Average MAE', 'Average MAPE']
results_df2 = pd.DataFrame(results2, columns=columns)

# Display the results as a table
display(Markdown("#### Model_1: Error Metric with K"))
display(results_df2)

Model_1: Error Metric with K¶

K Average RMSE Average MAE Average MAPE
0 1 19827.694309 15755.788963 6.785999
1 2 19812.484195 15845.667944 6.825048
2 3 19847.301024 15934.037434 6.863746
3 4 19337.849251 15222.538257 6.572814
4 5 19450.204539 15231.908482 6.576412
5 6 15675.625110 11906.911437 5.087793
6 7 15775.166358 11981.998545 5.115063
7 8 15376.484923 11987.946208 5.116251
8 9 15351.687733 11969.679931 5.108056
9 10 15351.687733 11969.679931 5.108056
10 11 15055.655637 11773.906589 5.017770
11 12 11986.739358 9368.973404 4.005628
12 13 11890.785749 9188.517916 3.930748
13 14 11189.916314 8630.867520 3.677548
14 15 10884.574017 8438.629805 3.590851
15 16 10901.983584 8459.681882 3.600329
16 17 10904.313575 8447.304293 3.597428
17 18 10608.828021 8068.359296 3.433803
18 19 10481.079928 8029.774965 3.417644
19 20 10470.849285 8032.238976 3.418831
20 21 9688.956847 7460.399061 3.197219
21 22 9687.305503 7496.259981 3.211278
22 23 9681.640492 7523.572427 3.219184
23 24 9822.021516 7518.552680 3.210974
24 25 9695.099291 7540.174332 3.228786
25 26 9693.578367 7562.003919 3.241291
26 27 9841.822150 7685.961912 3.292416
27 28 9800.718967 7681.699673 3.297598
28 29 9696.357201 7519.900861 3.224159

In Model 2, we have incorporated additional features, which comprise both polynomial and interaction terms, to enhance the analytical capabilities of the model.

It becomes evident that the most favourable model is achieved with K = 21.

 

4.3 Comparison: model_0 and model_2¶

In [51]:
# create a table 
data = {'Model': ['Model_2', 'Model_0'],
        'RMSE': [9688.956847, 14222.189128],
        'MAE': [7460.399061,11108.421988],
        'MAPE': [3.197219, 4.739425]}

errormetric_compare = pd.DataFrame(data)
display(Markdown("#### Error Metric:  Model_0 and Model_2"))
display(errormetric_compare)

Error Metric: Model_0 and Model_2¶

Model RMSE MAE MAPE
0 Model_2 9688.956847 7460.399061 3.197219
1 Model_0 14222.189128 11108.421988 4.739425

Since PCA transforms the original features and creating new, uncorrelated principal components, we did not gegenerate the statistical summary for these two models to compare statistics such as R-square, AIC, BIC or coefficients. To compare these two models, we only focus on error metrics.

As we can see from the table above, the predicting power is improved significantly in model_2 by adding interaction terms and polynominal terms.

Conclusion:

Model_2 has a lower values in error metric, which suggests that it is a better model in terms of goodness-of-fit and parsimony compared to Model_0.

4.4 Computing third model (model_3) using correlation coefficient for feature selection¶

In [52]:
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

#define selected predictors
columns = [ 'Number Of Low', 'Number Of Medium', 'Month', 'Weekday',
            'Minimum Temperature (°C)', 'Speed Of Maximum Wind Gust (Km/H)',
            '3Pm Relative Humidity (%)', 'Min Temp Cubed', 'Max Temp Cubed',
            '3pm Temp Cubed', 'Month:MinTem', 'Month:MaxTem']

#define independent and dependent variables
X3 = merged_1[columns]
y3 = merged_1['Daily Use']

# standardize the features except 'Weekday'
scaler = StandardScaler()
X3_scaled = X3.copy()
X3_scaled.loc[:, X3_scaled.columns != 'Weekday'] = scaler.fit_transform(X3.loc[:, X3.columns != 'Weekday'])

X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X3_scaled, y3, test_size=0.2, random_state=42)  
    
X_train_const_3 = sm.add_constant(X_train_3)

# sit the OLS model
model_3 = sm.OLS(y_train_3, X_train_const_3)

# sit the model
fitted_model_3 = model_3.fit()

# get the statistical summary
statistical_summary_3 = fitted_model_3.summary()
display(Markdown("#### Model_3: Statistical Summary")) 
print(statistical_summary_3)

Model_3: Statistical Summary¶

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              Daily Use   R-squared:                       0.881
Model:                            OLS   Adj. R-squared:                  0.873
Method:                 Least Squares   F-statistic:                     107.8
Date:                Sat, 15 Apr 2023   Prob (F-statistic):           3.96e-74
Time:                        13:34:47   Log-Likelihood:                -1977.7
No. Observations:                 188   AIC:                             3981.
Df Residuals:                     175   BIC:                             4024.
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
=====================================================================================================
                                        coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
const                              2.187e+05   1256.167    174.068      0.000    2.16e+05    2.21e+05
Number Of Low                     -1.017e+04   1992.000     -5.103      0.000   -1.41e+04   -6234.009
Number Of Medium                  -3238.4336   1742.153     -1.859      0.065   -6676.769     199.902
Month                              1.824e+04   4372.439      4.170      0.000    9605.728    2.69e+04
Weekday                            2.446e+04   1521.511     16.076      0.000    2.15e+04    2.75e+04
Minimum Temperature (°C)          -3.851e+04   5727.733     -6.723      0.000   -4.98e+04   -2.72e+04
Speed Of Maximum Wind Gust (Km/H)  1663.7720    855.265      1.945      0.053     -24.190    3351.734
3Pm Relative Humidity (%)          3948.8279    982.389      4.020      0.000    2009.973    5887.683
Min Temp Cubed                     2.259e+04   3435.715      6.575      0.000    1.58e+04    2.94e+04
Max Temp Cubed                     1.036e+04   2170.415      4.773      0.000    6075.859    1.46e+04
3pm Temp Cubed                     9023.1650   2379.189      3.793      0.000    4327.568    1.37e+04
Month:MinTem                        1.66e+04   3380.132      4.912      0.000    9933.299    2.33e+04
Month:MaxTem                      -2.664e+04   2622.478    -10.160      0.000   -3.18e+04   -2.15e+04
==============================================================================
Omnibus:                       14.099   Durbin-Watson:                   1.882
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               19.195
Skew:                          -0.484   Prob(JB):                     6.79e-05
Kurtosis:                       4.231   Cond. No.                         28.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

To identify the best model using correlation coefficient feature selection, we followed a approach consisting of the following steps:

1 We removed variables with an absolute value of correlation coefficient less than 0.3 to focus on the most relevant features.

2 To reduce the impact of multicollinearity on the model, we removed some of the highly correlated predictors.

3 We ran a linear regression model and examined the statistical summary to assess the significance of the features. Based on the coefficients and p-values derived from the statistical summary, we made adjustments to the features and computed a revised model.

Model_3 allowes us to make accurate predictions and draw meaningful insights from the data.

In analyzing model_3, we have focused on several key statistics.

R-squared and adjusted R-squared:

The R-squared value is 0.881, which means that the model explains 88.1% of the variance in the dependent variable. The adjusted R-squared value is 0.873, which takes into account the number of independent variables and is a more accurate measure of model fit.

F-statistic:

The F-statistic is 107.8, and the associated probability is very small (3.96e-74), which indicates that the model is statistically significant.

Coefficients:

The coefficients for each independent variable show the effect of that variable on the dependent variable, holding all other variables constant. For example, a one-unit increase in "Number Of Low" is associated with a decrease in "Daily Use" of 1017 units, assuming all other variables in the model remain constant. .

P-values:

All variables are statistically significant at 95% level, except for "Number of Medium" and "Speed of Maximum Wind Gust (Km/H)" which have p-values of 0.065 and 0.053 respectively, marginally higher than 0.05.

Multicollinearity:

The condition number = 28.1, indicating no significant multicollinearity issues.

In [53]:
# define the mean_absolute_percentage_error function
def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

# prepare the data
columns = [ 'Number Of Low', 'Number Of Medium', 'Month', 'Weekday',
       'Minimum Temperature (°C)', 'Speed Of Maximum Wind Gust (Km/H)',
       '3Pm Relative Humidity (%)', 'Min Temp Cubed', 'Max Temp Cubed',
       '3pm Temp Cubed', 'Month:MinTem', 'Month:MaxTem' ]

X3 = merged_1[columns]
y3 = merged_1['Daily Use']

# standardize the features except 'Weekday'
scaler = StandardScaler()
X3_scaled = X3.copy()
X3_scaled.loc[:, X3_scaled.columns != 'Weekday'] = scaler.fit_transform(X3.loc[:, X3.columns != 'Weekday'])


# split the data using KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
lm = LinearRegression()

rmse_scores3 = []
mae_scores3 = []
mape_scores3 = []

for train_index, test_index in kf.split(X3_scaled):
    X_train3, X_test3 = X3_scaled.iloc[train_index], X3_scaled.iloc[test_index]
    y_train3, y_test3 = y3.iloc[train_index], y3.iloc[test_index]
    
    model_3 = lm.fit(X_train3, y_train3)
    y_pred3 = lm.predict(X_test3)
    
    rmse3 = np.sqrt(mean_squared_error(y_test3, y_pred3))
    mae3 = mean_absolute_error(y_test3, y_pred3)
    mape3 = mean_absolute_percentage_error(y_test3, y_pred3)
    
    rmse_scores3.append(rmse3)
    mae_scores3.append(mae3)
    mape_scores3.append(mape3)
    
  

    avg_rmse3 = np.mean(rmse_scores3)
    avg_mae3 = np.mean(mae_scores3)
    avg_mape3 = np.mean(mape_scores3)





# create a table for error metrics
error_table = pd.DataFrame({
    'Fold': range(1, 6),
    'RMSE': rmse_scores3,
    'MAE': mae_scores3,
    'MAPE': mape_scores3})
display(Markdown("#### Error Metric By Folders: Model_3"))
display(error_table)

# create a table for average error metrics
avg_error_table = pd.DataFrame({
    'Error Metric': ['RMSE', 'MAE', 'MAPE'],
    'Average Error': [avg_rmse3, avg_mae3, avg_mape3]})
display(Markdown("#### Error Metric By Average: Model_3"))
display(avg_error_table)

Error Metric By Folders: Model_3¶

Fold RMSE MAE MAPE
0 1 9607.062367 7727.804152 3.224131
1 2 8780.053862 7651.410321 3.366191
2 3 10555.217055 8612.614501 3.677276
3 4 7259.749627 5529.940879 2.419758
4 5 11274.842725 8059.533868 3.467748

Error Metric By Average: Model_3¶

Error Metric Average Error
0 RMSE 9495.385127
1 MAE 7516.260744
2 MAPE 3.231021
In [54]:
from sklearn import linear_model

#define variables
X5 = X3_scaled
y5 = merged_1['Daily Use']

# set up training and testing data
X_train5, X_test5, y_train5, y_test5 = train_test_split(X5, y5, test_size=0.2, random_state=42)

lm = linear_model.LinearRegression()
model5 = lm.fit(X_train5, y_train5)
y_test_predictions = lm.predict(X_test5)
import pandas as pd
from IPython.display import display

df = pd.DataFrame({'Actual': y_test5[:10], 'Predicted': y_test_predictions[:10]})
display(Markdown("#### Prediction V.S. Actual Value: Model_3"))
display(df.round(2))

Prediction V.S. Actual Value: Model_3¶

Actual Predicted
70 213975.93 208677.44
195 264775.90 258697.55
183 260611.16 246132.42
9 290620.38 297152.93
128 238268.73 239913.45
110 247411.63 239325.27
234 250985.58 272143.90
94 214600.56 223278.95
152 248586.68 251981.40
16 203849.71 214367.33

4.5 Model comparison¶

In [55]:
data = {'Model_0': {'RMSE': 14222.189128, 'MAE': 11108.421988, 'MAPE': 4.739425},
        'Model_2': {'RMSE': 9688.96, 'MAE': 7460.40, 'MAPE': 3.20},
        'Model_3': {'RMSE': 9495.38, 'MAE': 7516.26, 'MAPE': 3.23}}

# create a dataframe from the dictionary
df = pd.DataFrame.from_dict(data, orient='index')

# display the dataframe using display()
display(Markdown("#### Model Comparison: Error Metric"))
display(df)

Model Comparison: Error Metric¶

RMSE MAE MAPE
Model_0 14222.189128 11108.421988 4.739425
Model_2 9688.960000 7460.400000 3.200000
Model_3 9495.380000 7516.260000 3.230000

As we do not obtain the statistical summar for model_0 and model_2, here, we focus on error metric comparison shown above.

In terms of error metrics, it is obvious that model_2 and model_3, with polynomial terms and interaction variables, have much better predicting power compared with model_0. 

It is worth noting that model_3 has a slightly higher MAE and MAPE than model_2 but performs better in RMSE. Therefore, we would consider  model_3 and model_2 to have the same level of predicting power.

In conclusion,model_3 has the same predicting power compared with model_2 but has reduced (12) predictors.

We strongly recommend using medel_3 as the forecasting model to predict daily maximum energy demand.

Section 5: KEY FINDINGS¶

5.1 The top 10 most important predictors¶

To identify the top 10 most important predictors (except intercept), we can refer to the absolute t-statistic values in the regression results. The higher the t-statistic, the more significant the predictor is. Here are the top 10 predictors in descending order of importance, based on their t-statistics:

1 Weekday

2 Month:MaxTem

3 Minimum Temperature (°C)

4 Min Temp Cubed

5 Number Of Low

6 Max Temp Cubed

7 Month:MinTem

8 Month

9 3Pm Relative Humidity (%)

10 3pm Temp Cubed

Alternatively, we can look at the absolute values of the coefficients. The larger the absolute value of a coefficient, the stronger the effect of that predictor on the dependent variable.

Based on the provided regression results, the top 10 most important predictors, in descending order of the absolute value of coefficients, are:

1 Minimum Temperature (°C)

2 Month:MaxTem

3 Weekday

4 Min Temp Cubed

5 Month:MinTem

6 Month

7 Number Of Low

8 Max Temp Cubed

9 3pm Temp Cubed

10 3Pm Relative Humidity (%)

It is important to highlight that merely two predictors from the original dataset have been identified among the top 10 most significant variables. This observation underscores the effectiveness of the newly created features in enhancing the model's performance.

5.2 Weekend and weekday¶

It's interesting that energy consumption patterns vary between weekdays and weekends, with considerably lower usage on the weekends.

One possible reason is that many businesses and workplaces are closed on weekends, resulting in a decrease in commercial energy consumption.

During weekends, people tend to spend more time at home. This trend suggests that staying at home may be a more energy-efficient option.

5.3 Variables related to tempreture¶

Our analysis has revealed a V-shaped pattern in the variables related to temperature concerning the 'Daily Use'. This pattern indicates that energy demand is likely to rise during periods of extreme temperatures, whether hot or cold.

Furthermore, it's important to note that the association between temperature and energy consumption can differ across various seasons. In summers, elevated temperatures typically coincide with a surge in energy demand, while in winters, energy demand tends to rise as temperatures drop.

Section 6: LIMITATIONS AND RECOMMENDATIONS¶

6.1 Limitations¶

Incompleted data¶

Our dataset comprises only 8 months of figures, which falls short of a complete annual cycle of variables. This poses a challenge in capturing the entire trend of variables, thus limiting our ability to make highly accurate predictions.

Inadequate data efficiency¶

Using data collected during the COVID-19 period to predict post-pandemic daily energy use may introduce potential challenges due to data effectiveness.

The circumstances during the COVID-19 period might diverge significantly from those in a post-pandemic environment, including differences in energy policies, consumer energy usage patterns, and economic conditions.

The unique circumstances during the pandemic could limit the predictive power of the dataset, potentially leading to inaccurate or less reliable forecasts.

6.2 RECOMMENDATIONS¶

Enhancing the Model with New Features¶

Incorporating additional features into our model has the potential to enhance its predictive accuracy. Consider the following possibilities:

1 Numerical Energy Price Data: Instead of using price categories, incorporating precise energy price data may serve as a robust predictor, enabling the creation of a more accurate regression model.

2 'School Day' and 'Holiday' Variables: By introducing these dummy variables, we can better account for variations in energy consumption patterns. These variables complement the existing 'weekday' variable and offer a more comprehensive understanding of daily energy usage trends.

Collecting Up-to-Date Data for Energy Use Predictions¶

To enhance the accuracy of energy use predictions, it is crucial to gather data from the post-COVID era. Additionally, incorporating data spanning 2-3 years prior to the pandemic can serve as a valuable supplementary resource.

Exploring Alternative Models¶

To enhance the accuracy of our predictions, it is advisable to investigate more sophisticated regression models, including:

1 Regularization techniques, like Lasso and Elastic Net, which assist in mitigating multicollinearity issues.

2 Ensemble tree-based models, such as Random Forest and XGBoost, with optimized parameters that have the potential to surpass the performance of our current models.